Group Discussion: “Correlation is not Causation”
What do people mean when they say this?
What’s an example of correlation not being causation?
Can you think of any examples where causation isn’t correlation?
When can we be more sure that correlation = causation?
Prediction: goal is to get output as close to actual output as possible
Description: goal is to describe the data
Inference: goal is to learn about relationships between variables
🦟 Design an experiment to test whether bed nets reduce malaria infections
🦟 What’s a potential issue with your experiment?
from: https://www.r-causal.org/chapters/02-whole-game
Why would random assignment to mosquito netting help here?
You have a headache. Jennifer gives you what looks like a pill. An hour later your headache goes away. How do you reason about whether or not the “pill” causes your headache to go away?
x: wealth, z : using paxlovid, y : covid severityWhich of these paths are causal for our question? non-causal?
is this path the causal path we’re interested in?
if we can get an unbiased causal estimate of y ~ z using either set, what’s an argument for adjusting for x?
❓ Are there any paths that connect podcast to exam? Look for forks, colliders, or chains.
dagitty and ggdagdagitty and ggdaglibrary(ggplot2)
my_first_dag <- ggdag::dagify(
x ~ y,
y ~ z + w,
q ~ z + x,
exposure = "x",
outcome = "z"
)
# ggdag(my_first_dag, layout = "time_ordered") +
# theme_dag() +
# theme(legend.position = "top")
ggdag_paths(my_first_dag, layout = "time_ordered") +
theme_dag() +
theme(legend.position = "top")library(ggplot2)
my_first_dag <- ggdag::dagify(
x ~ y,
y ~ z + w,
q ~ z + x,
exposure = "x",
outcome = "z"
)
# ggdag(my_first_dag, layout = "time_ordered") +
# theme_dag() +
# theme(legend.position = "top")
# ggdag_paths(my_first_dag, layout = "time_ordered") +
# theme_dag() +
# theme(legend.position = "top")
ggdag_adjustment_set(my_first_dag) +
theme_dag() +
theme(legend.position = "top")library(ggplot2)
my_first_dag <- ggdag::dagify(
z ~ x + y,
y ~ x,
exposure = "x",
outcome = "z"
)
ggdag(my_first_dag) + theme_dag(){ y }
From: Causal Inference the Mixtape (Cunningham)
# from: https://mixtape.scunning.com/03-directed_acyclical_graphs
library(tidyverse)
tb <- tibble(
female = ifelse(runif(10000)>=0.5,1,0),
ability = rnorm(10000),
discrimination = female,
occupation = 1 + 2*ability + 0*female - 2*discrimination + rnorm(10000),
wage = 1 - 1*discrimination + 1*occupation + 2*ability + rnorm(10000)
)
# open path d -> o -> i
lm_1 <- lm(wage ~ female, tb)
# open path d -> o <- a -> y
lm_2 <- lm(wage ~ female + occupation, tb)
# woo!
lm_3 <- lm(wage ~ female + occupation + ability, tb)
Call:
lm(formula = wage ~ female, data = tb)
Residuals:
Min 1Q Median 3Q Max
-18.726 -2.859 -0.055 2.916 14.153
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.95419 0.05926 32.98 <2e-16 ***
female -3.02009 0.08459 -35.70 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.229 on 9998 degrees of freedom
Multiple R-squared: 0.1131, Adjusted R-squared: 0.113
F-statistic: 1275 on 1 and 9998 DF, p-value: < 2.2e-16
Call:
lm(formula = wage ~ female + occupation, data = tb)
Residuals:
Min 1Q Median 3Q Max
-5.8292 -0.9117 -0.0010 0.9218 5.1408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.196475 0.019963 9.842 <2e-16 ***
female 0.628497 0.029882 21.033 <2e-16 ***
occupation 1.813026 0.006156 294.535 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.359 on 9997 degrees of freedom
Multiple R-squared: 0.9084, Adjusted R-squared: 0.9083
F-statistic: 4.954e+04 on 2 and 9997 DF, p-value: < 2.2e-16
Call:
lm(formula = wage ~ female + occupation + ability, data = tb)
Residuals:
Min 1Q Median 3Q Max
-4.5080 -0.6892 -0.0075 0.6808 3.8340
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.00354 0.01727 58.10 <2e-16 ***
female -1.00509 0.02856 -35.19 <2e-16 ***
occupation 1.00273 0.01004 99.84 <2e-16 ***
ability 2.01209 0.02222 90.57 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.008 on 9996 degrees of freedom
Multiple R-squared: 0.9497, Adjusted R-squared: 0.9496
F-statistic: 6.286e+04 on 3 and 9996 DF, p-value: < 2.2e-16
Draw a DAG that looks at the relationship between Parent’s grades in high school and their children’s grades in high school. Include as many factors as you can think of!
Build this DAG using ggdag or dagitty and look at the possible adjustment sets!
Directed: edges have a direction from one node to another node; causes
Acyclic: there are no cycles that allow something to cause itself/feedback loops
Graphs: variables are connected via edges according to their causality
DAGs don’t specify parametric assumptions about the relationships such as:
how strong the relationships are
whether the relationships are linear/non-linear etc.
the DAG structure just shows that we think the relationship exists.
💡 Do you think this DAG is valid?
Fork: z is a confounder
Chain: z is a mediator
Collider: z is a collider
x is umbrella usage, y is sprinkler behavior, z is rain
x is annual income, y is cognitive test score , z is years of education
set.seed(540)
n <- 10000
years_ed <- rnorm(n, 12,4) %>% round()
cog_test <- -20 + years_ed*2.4 + rnorm(n,0,25)
annual_income <- 60000 + years_ed*1500 + rnorm(n,0,15000)
df <- data.frame(years_ed, cog_test, annual_income)
# build reg
lin_reg <- lm(annual_income ~ cog_test,
data = df) %>% summary()
lin_reg$coef %>% round(4) Estimate Std. Error t value Pr(>|t|)
(Intercept) 77062.1561 167.8604 459.0849 0
cog_test 84.3694 6.0006 14.0601 0
❓ Can we interpret the coefficient of cog_test causally? If we increase your cog_test scores, would we expect you to make more money?
x is annual income, y is cognitive test score , z is years of education# build reg
lin_reg <- lm(annual_income ~ cog_test + years_ed,
data = df) %>% summary()
lin_reg$coef %>% round(4) Estimate Std. Error t value Pr(>|t|)
(Intercept) 59590.8788 489.6401 121.7034 0.0000
cog_test -0.6094 6.0516 -0.1007 0.9198
years_ed 1523.3561 40.4359 37.6733 0.0000
💡 “controlling for” years_ed gives us a better estimate of the effect of cog_test
For forks, we want to control for the effect of the confounder to get a causal estimate of y ~ x
Note: We’ll talk about more ways to do this next time.
x is hours studying, y is exam score, z is material comprehension
x is diet quality (nutrient density) , y is mood , z is gut microbiome
set.seed(540)
n <- 100
diet_quality <- rnorm(n)
microbiome <- diet_quality*2.5 + rnorm(n)
mood <- microbiome*2 + rnorm(n, sd = 2)
df <- data.frame(diet_quality, microbiome, mood)
# build reg
lin_reg <- lm(mood ~ diet_quality,
data = df) %>% summary()
lin_reg$coef %>% round(4) Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1634 0.2966 -0.5509 0.583
diet_quality 5.1662 0.3146 16.4226 0.000
We expect the total effect to be \(\beta_{dq \to m} * \beta_{m \to mood} = 2.5 * 2 = 5\)
❓ Can we interpret the coefficient of diet_quality causally? If we increase your diet_quality , would we expect you to improve mood?
❓ What if we control for microbiome?
When we say we are “controlling for” a variable, we want the effect of x on y after accounting for z. Covariates are one way of controlling for.
In this example: what is the effect of diet_quality if we already account for microbiome.
For chains we do not want to control for mediators to get an indirect causal estimate of y ~ x
For chains we dowant to control for mediators to get a direct causal estimate of y ~ x
x: caffeine consumption, y: productivity, z sleep quality.❓ What is the direct effect of caffeine on productivity? What is the indirect effect of caffeine on productivity?
x is being funny, y is being hot, z is dating me
x is extracurriculars , y is good exam scores , z is getting into Chapman
set.seed(12938)
n <- 1000
logit <- function(x){
return(1/(1 + exp(-x)))
}
extra_curriculars <- rnorm(n)
exam_scores <- rnorm(n)
acceptance <- rbinom(n, size = 1, prob = logit(1.6*extra_curriculars + exam_scores*2))
df <- data.frame(extra_curriculars, exam_scores, acceptance)
# build reg
lin_reg <- lm(exam_scores ~ extra_curriculars,
data = df) %>% summary()
lin_reg$coef %>% round(4) Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.0416 0.0319 -1.3028 0.1930
extra_curriculars 0.0076 0.0317 0.2395 0.8108
❓ Can we interpret the coefficient of extra_curriculars causally? If we increase your extra_curriculars , would we expect you to improve grades?
❓ What if we control for acceptance?
For colliders we do not want to condition on the collider in order to get a causal estimate of y ~ x
Collider bias can be a form of selection bias.
Fork: z is a confounder (condition)
Chain: z is a mediator (maybe condition)
Collider: z is a collider (don’t condition)
Open: paths that transmit association between x and y (forks, chains)
Closed: paths that do not transmit association between x and y (colliders)
Open
Non-causal